Skip to contents

Overview

This vignette explains how to prepare your data for SigFun. Both the streamlined (sig2Fun()) and stepwise (sigCor() + GSEA + plotting) workflows require the same input format: a properly constructed SummarizedExperiment (SE). This document adds a small checklist, clear column requirements, and validation helpers so your data passes seamlessly into either workflow.

Quick Checklist

  • Expression (assays$abundance): numeric matrix genes × samples; no columns or rows should contain all values of NA; zeros allowed; preprocessed (e.g., log‑TPM/CPM).

  • RowData (rowData): gene annotations with required columns: ensg_id, gene_symbol, gene_biotype.

  • ColData (colData): sample info where column names match expression column names. Include ≥1 signature column (numeric or binary 0/1). Add any other covariates as needed.

  • Ontology (t2g): a data.frame with columns gs_name and ensembl_gene using the same gene ID type as rowData$ensg_id (Ensembl IDs recommended). (See Ontology Database Setup for setup)

Load Packages

Input Data Requirements

SummarizedExperiment Structure

SigFun expects an SE with three essentials:

  • Assay: expression matrix (genes × samples), stored as assays$abundance

  • RowData: gene annotations (must align row‑wise with the assay)

  • ColData: sample information (must align column‑wise with the assay)

Assay: Expression Data

Format

  • Rows = genes; Columns = samples; Values = numeric expression.

  • Zeros are allowed; no columns or rows should contain all values of NA or characters.

  • If using raw counts, consider normalization and optional log transform.

# Expression data
data("expr.data")
dim(expr.data)
#> [1] 17341   127
expr.data %>% DT::datatable(options = list(pageLength = 10, scrollX = TRUE))

RowData: Gene Annotations

Required columns:

  • ensg_id — Ensembl Gene ID (ENSG...)
  • gene_symbol — HGNC gene symbol (e.g., TP53)
  • gene_biotype — e.g., protein_coding, lncRNA

Important: Genes must be in the same order as the rows of expr.data, and rownames(rowData) must equal ensg_id.

# Gene annotations
data("mapping")
# ensure rownames
rownames(mapping) <- mapping$ensg_id
mapping %>% DT::datatable(options = list(pageLength = 10, scrollX = TRUE))

ColData: Sample Information & Signatures

Provide sample‑level metadata with at least one signature column. You may include multiple signatures (one per column). For binary signatures used with cor.method = "logit", code as 0/1.

Required columns:

  • sample_id — must match column names of the expression matrix

  • ≥1 signature column — name freely chosen (e.g., my_signature), numeric or 0/1

Note: Both sample_id and value columns must not contain missing values (NA).

# Sample metadata & signatures
data("SIG_MAT")
dim(SIG_MAT)
#> [1] 127   2
SIG_MAT %>% DT::datatable(options = list(pageLength = 10, scrollX = TRUE), rownames = FALSE)

Note: If your signature column is not named value, that’s fine—SigFun reads the signature(s) from colData internally. Keep names informative (e.g., Tcell_score, EMT_binary).

(Optional) Pre‑computed cor.df for Custom Ranking

Advanced users can attach a custom ranking table into the SE metadata for downstream enrichment/plotting:

# Example shape: data.frame(gene = ensg_id, cor = numeric, pval = numeric, ...)
meta <- S4Vectors::metadata(your_SE)
meta$cor.df <- your_precomputed_cordf
S4Vectors::metadata(your_SE) <- meta

Building Your SummarizedExperiment

From your data (template)

# Step 1: Prepare your matrices/data frames
expr_matrix <- as.matrix(your_expression_data) # genes × samples
gene_info <- your_gene_annotations # contains ensg_id, gene_symbol, gene_biotype
sample_info <- your_sample_metadata_with_signatures

# Step 2: Align & validate
stopifnot(
  is.numeric(expr_matrix[1,1]),
  is.matrix(expr_matrix),  # Ensure it's actually a matrix
  nrow(expr_matrix) == nrow(gene_info),
  ncol(expr_matrix) == nrow(sample_info),  # Check column count matches samples
  # Column names of expression must match a sample_id column
  all(colnames(expr_matrix) == sample_info$sample_id),
  # Row names should match gene identifiers
  all(rownames(expr_matrix) == gene_info$gene_id),  # or whichever ID column
  # No rows should be entirely NA
  !any(rowSums(is.na(expr_matrix)) == ncol(expr_matrix)),
  # No columns should be entirely NA
  !any(colSums(is.na(expr_matrix)) == nrow(expr_matrix))
)


# Ensure row names for rowData are Ensembl IDs
gene_info <- as.data.frame(gene_info)
rownames(gene_info) <- gene_info$ensg_id


# Step 3: Construct SE
your_SE <- SummarizedExperiment::SummarizedExperiment(
  assays = list(abundance = expr_matrix),
  rowData = S4Vectors::DataFrame(gene_info),
  colData = S4Vectors::DataFrame(sample_info)
)


# Step 4: Inspect
your_SE

From bundled demo data (runnable)

SE_demo <- SummarizedExperiment::SummarizedExperiment(
  assays = list(abundance = as.matrix(expr.data)),
  rowData = S4Vectors::DataFrame(mapping, row.names = mapping$ensg_id),
  colData = S4Vectors::DataFrame(SIG_MAT)
)
SE_demo
#> class: SummarizedExperiment 
#> dim: 17341 127 
#> metadata(0):
#> assays(1): abundance
#> rownames(17341): ENSG00000288642 ENSG00000288611 ... ENSG00000000005
#>   ENSG00000000003
#> rowData names(3): ensg_id gene_symbol gene_biotype
#> colnames(127): 112606 122287 ... 995480 999264
#> colData names(2): sample_id value

Sanity checks (optional but recommended)

stopifnot(
  is.matrix(SummarizedExperiment::assay(SE_demo, "abundance")),
  is.numeric(SummarizedExperiment::assay(SE_demo, "abundance")),
  # No rows should be entirely NA
  all(rowSums(is.na(SummarizedExperiment::assay(SE_demo, "abundance"))) < ncol(SummarizedExperiment::assay(SE_demo, "abundance"))),
  # No columns should be entirely NA
  all(colSums(is.na(SummarizedExperiment::assay(SE_demo, "abundance"))) < nrow(SummarizedExperiment::assay(SE_demo, "abundance"))),
  # Row names must match ensg_id column
  all(rownames(SummarizedExperiment::rowData(SE_demo)) == SummarizedExperiment::rowData(SE_demo)$ensg_id),
  # Column names must match sample identifiers in colData
  all(colnames(SummarizedExperiment::assay(SE_demo, "abundance")) == rownames(SummarizedExperiment::colData(SE_demo))),
  # Dimensions must match
  nrow(SummarizedExperiment::assay(SE_demo, "abundance")) == nrow(SummarizedExperiment::rowData(SE_demo)),
  ncol(SummarizedExperiment::assay(SE_demo, "abundance")) == nrow(SummarizedExperiment::colData(SE_demo))
)

Appendices

Ontology Database Setup

# Prepare gene ontology data. We strongly recommend MsigDB.
# The following code can help you to 
if (!require(msigdbr, quietly = TRUE)) {
    install.packages("msigdbr")
    library(msigdbr)
}
# Download human HALLMARKER ontology information
H_t2g <- msigdbr::msigdbr(species = "Homo sapiens", category = "H") %>% 
    dplyr::select(gs_name, ensembl_gene)

# Download human C2 pathway information
C2_t2g <- msigdbr::msigdbr(species = "Homo sapiens", category = "C2") %>% 
    dplyr::filter(gs_subcat %in% c("CP:BIOCARTA","CP:KEGG_MEDICUS",
        "CP:REACTOME","CP:WIKIPATHWAYS")) %>% 
    dplyr::select(gs_name, ensembl_gene)
# Download human C5 ontology information
C5_t2g <- msigdbr::msigdbr(species = "Homo sapiens", category = "C5") %>% 
    dplyr::filter(gs_subcat %in% c("GO:BP","GO:CC","GO:MF")) %>% 
    dplyr::select(gs_name, ensembl_gene)

# Combine them
t2g <- rbind(H_t2g, C2_t2g, C5_t2g)

Session Information

#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-pc-linux-gnu
#> Running under: CentOS Stream 9
#> 
#> Matrix products: default
#> BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Asia/Taipei
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4      SigFun_0.99.11   BiocStyle_2.34.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.3                   rlang_1.1.6                
#>   [3] magrittr_2.0.3              DOSE_4.0.1                 
#>   [5] ggridges_0.5.7              matrixStats_1.5.0          
#>   [7] compiler_4.4.3              RSQLite_2.4.3              
#>   [9] png_0.1-8                   systemfonts_1.2.3          
#>  [11] vctrs_0.6.5                 reshape2_1.4.4             
#>  [13] stringr_1.5.2               pkgconfig_2.0.3            
#>  [15] crayon_1.5.3                fastmap_1.2.0              
#>  [17] backports_1.5.0             XVector_0.46.0             
#>  [19] rmarkdown_2.29              UCSC.utils_1.2.0           
#>  [21] ragg_1.3.3                  purrr_1.1.0                
#>  [23] bit_4.6.0                   xfun_0.53                  
#>  [25] zlibbioc_1.52.0             cachem_1.1.0               
#>  [27] aplot_0.2.8                 GenomeInfoDb_1.42.3        
#>  [29] jsonlite_2.0.0              blob_1.2.4                 
#>  [31] DelayedArray_0.32.0         BiocParallel_1.40.2        
#>  [33] broom_1.0.10                parallel_4.4.3             
#>  [35] R6_2.6.1                    bslib_0.9.0                
#>  [37] stringi_1.8.7               RColorBrewer_1.1-3         
#>  [39] car_3.1-3                   GenomicRanges_1.58.0       
#>  [41] jquerylib_0.1.4             GOSemSim_2.32.0            
#>  [43] Rcpp_1.1.0                  bookdown_0.44              
#>  [45] SummarizedExperiment_1.36.0 knitr_1.50                 
#>  [47] ggtangle_0.0.7              R.utils_2.13.0             
#>  [49] IRanges_2.40.1              igraph_2.1.4               
#>  [51] Matrix_1.7-4                splines_4.4.3              
#>  [53] tidyselect_1.2.1            qvalue_2.38.0              
#>  [55] rstudioapi_0.17.1           abind_1.4-8                
#>  [57] yaml_2.3.10                 codetools_0.2-20           
#>  [59] lattice_0.22-7              tibble_3.3.0               
#>  [61] plyr_1.8.9                  treeio_1.30.0              
#>  [63] Biobase_2.66.0              KEGGREST_1.46.0            
#>  [65] S7_0.2.0                    evaluate_1.0.5             
#>  [67] gridGraphics_0.5-1          desc_1.4.3                 
#>  [69] Biostrings_2.74.1           ggtree_3.14.0              
#>  [71] ggpubr_0.6.1                pillar_1.11.0              
#>  [73] BiocManager_1.30.26         carData_3.0-5              
#>  [75] MatrixGenerics_1.18.1       DT_0.34.0                  
#>  [77] stats4_4.4.3                ggfun_0.2.0                
#>  [79] generics_0.1.4              S4Vectors_0.44.0           
#>  [81] ggplot2_4.0.0               tidytree_0.4.6             
#>  [83] scales_1.4.0                glue_1.8.0                 
#>  [85] lazyeval_0.2.2              tools_4.4.3                
#>  [87] data.table_1.17.8           ggsignif_0.6.4             
#>  [89] fgsea_1.32.4                forcats_1.0.0              
#>  [91] fs_1.6.6                    fastmatch_1.1-6            
#>  [93] cowplot_1.2.0               grid_4.4.3                 
#>  [95] ape_5.8-1                   tidyr_1.3.1                
#>  [97] crosstalk_1.2.2             AnnotationDbi_1.68.0       
#>  [99] nlme_3.1-168                patchwork_1.3.2            
#> [101] GenomeInfoDbData_1.2.13     Formula_1.2-5              
#> [103] cli_3.6.5                   rappdirs_0.3.3             
#> [105] textshaping_1.0.3           viridisLite_0.4.2          
#> [107] S4Arrays_1.6.0              gtable_0.3.6               
#> [109] rstatix_0.7.2               R.methodsS3_1.8.2          
#> [111] yulab.utils_0.2.1           sass_0.4.10                
#> [113] digest_0.6.37               BiocGenerics_0.52.0        
#> [115] ggrepel_0.9.6               SparseArray_1.6.2          
#> [117] ggplotify_0.1.2             htmlwidgets_1.6.4          
#> [119] farver_2.1.2                memoise_2.0.1              
#> [121] htmltools_0.5.8.1           pkgdown_2.1.3              
#> [123] R.oo_1.27.1                 lifecycle_1.0.4            
#> [125] httr_1.4.7                  GO.db_3.20.0               
#> [127] bit64_4.6.0-1